arXiv:1503.01839v2 [math.NT] 28 May 2015 


THE COMBINATORIAL ALGORITHM FOR COMPUTING tt{x) 


DOUGLAS B. STAPLE 


Abstract. This paper describes recent advances in the combinatorial method 
for computing 7r(x), the number of primes < x. In particular, the memory 
usage has been reduced by a factor of log x, and modifications for shared- and 
distributed-memory parallelism have been incorporated. The resulting method 
computes 7r(a;) with complexity 0{x^^^log~^x) in time and 0{x^^^log^x) in 
space. The algorithm has been implemented and used to compute 7r(10") for 
1 < n < 26 and 7r(2"*) for 1 < m < 86. The mathematics presented here is 
consistent with and builds on that of previous authors. 


1. Introduction 

Algorithms used in exact calculations of 7r(a;) can be divided into roughly three 
categories. The simplest algorithms are based on identifying and counting each 
prime p < x, typically using some modihcation of the sieve of Eratosthenes. A 
naive implementation of the sieve of Eratosthenes uses 0(a;loglogx) arithmetic 
operations and 0{x) bits of memory 0. Modern variants based on bucket siev¬ 
ing reduce the memory usage to roughly '^{■y/x) storage locations each of width 
log 2 bits, while leaving the time complexity unchanged [TH]. Given the prime 

number theorem, algorithms that enumerate the primes p < x are limited to time 
complexity r2(a;/logx). 

The first published algorithm capable of computing 7r(a;) substantially faster 
than the sieve of Eratosthenes was a combinatorial algorithm due to E. Meissel 
M- Given that Meissel’s method involved decisions based on human judgement, 
it is not clear what time complexity to attribute to it; despite this fact, authors 
usually estimate the time complexity of Meissel’s original method as for 

any e > 0 [7]. Meissel used his method in hand calculations of 7r(10®) and 7r(10®) in 
the late 1800s [T11[T^; the method was substantially improved by multiple groups 
of authors, and used in record computations of 7r(10") for 10 < n < 23 between 
1956 and 2007 [TTJ [T31 [3 [TJ [31 IHl dZ]- Meissel’s method and its descendants are 
collectively known as “the” combinatorial algorithm for computing 'k{x). 

Analytic algorithms for computing 'k{x) based on the Riemann zeta function were 
first presented by Lagarias and Odlyzko in the 1980s mmm- Despite the attractive 
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^When discussing space complexity, one must distinguish between bits of storage and storage 
locations, each of which grows as the problem size increases. It is commonplace to state that an 
algorithm has complexity 0(M) in space if it requires 7M storage locations for some constant 
7, each capable of storing a number with log2 M bits OEIEIITT]. We use this convention here 
for consistency with other authors. Similarly, in a model of time complexity, one must specify 
which operations are considered to be performed in constant time. In this paper, we count bitwise 
operations, addition, subtraction, multiplication, division, modulus, decisions (branches), and 
memory read and write operations of a single machine word. 
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complexity of in time and in space, for any e > 0, the implied 

constants were large, and no-one succeeded in developing a practical implementa¬ 
tion of these methods until nearly 30 years later. The first record computation using 
an analytic method was 7r(10^^), under the assumption of the Riemann hypothesis, 
by Franke, Kleinjung, Biithe, dost in 2010 [4]. This was followed by a 2012 compu¬ 
tation of the same value by Platt without assuming the Riemann hypothesis [20) . 
Biithe et al. subsequently modihed their algorithm to eliminate the assumption of 
the Riemann hypothesis, and presented the first computation of 7r(10^^) [3]. 

The problem of determining the number of primes up to some limit is directly 
tied to the history of the primes themselves, which dates to antiquity and is beyond 
the scope of this paper. Thus, the paragraphs above are only a sketch of the history 
of the problem; we direct the interested reader to additional historical references 

[HEIlllolElllg]. 

In the current paper, we describe recent advances to the combinatorial algorithm 
for computing 7r(a;). Firstly, we show how the memory usage of the algorithm can be 
reduced by a factor of log x. We note that this is not only a reduction in the memory 
complexity, but a substantial reduction in the actual memory usage for relevant 
values of x. Indeed, before the final step in the memory-complexity reduction was 
achieved, the author had already reduced the memory usage sufficiently to compute 
7r(10^®), so the original announcement of 7r(10^®) claimed only a constant-factor 
reduction in the memory usage. In addition to the reduction in memory usage, we 
describe mechanisms by which the algorithm can be parallelized. Multiple methods 
due to the author and others are presented for shared-memory parallelism. We also 
describe a previously unpublished algorithm for distributed-memory parallelism, 
loosely based on the idea presented in [5]. The algorithms described here were 
implemented and used to compute 7r(10”) for 1 < n < 26 and 7r(2'^) for I < m < 86. 


2. Reducing space complexity 

Three data structures dominate the memory usage in the combinatorial algo¬ 
rithm mi 0 II1 E]: a table of 7r(y) for y < j/max, a table of the smallest prime 
factor Prainiv)^ also for y < j/max, and a set of 2 ^ sieve counters, where a typical 
choice for L G N is L = [log 2 ?/maxJ [2] . Each of these three data structures limits 
the space complexity of the algorithm to O(ymax)- The choice ?/max = cxx^^^ with 
a = piog^x for some /3 G M is used in the most recent versions of the algorithm 
muz] to achieve the time complexity 0{x‘^^Hog ^x), simultaneously setting the 
space complexity at 0(x^/^log^x). The next largest data structure is a table of 
primes pb for b < 7r(2/„iax), which has size 0{y^^^/ logymax) = 0(x^/^log^x). Thus, 
to decrease the memory usage of the algorithm by a factor of logx, we must either 
reduce each of the limiting data structures by a factor of logi/max or more, or else 
eliminate them entirely. 

We note that not all expositions of the algorithm are limited by all three of the 
above data structures. For example, Oliveira e Silva was aware that significantly 
smaller sieve counters can be used than implied by L = [log 2 ?/maxJ, although he 
does advocate storing 7r(y) and Pm\n{y) for y < t/max [IZ]- This is to be contrasted 
with Deleglise and Rivat, who use ?/max sieve counters, and store 7r(y) for y < ?/max, 
but manage to eliminate Pmin(y) from their hnal formulae [3]. 
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2.1. Retrieving TT{y) for y < i/max in 0(1) time using 0(?/max/log i/max) space. 
The values Tr{y) are used in many places in the algorithm [HITT]. The authors of past 
studies advocate the use of a table of values for this purpose, which requires 0(2/„iax) 
storage locations. The implied constant is 1 in the simplest implementation, where 
a single storage location is used to store a single value of TT{y). This constant can be 
reduced somewhat using a wheel, for example only storing 7r(y) for those y coprime 
to the first c primes, for some c € N. However, a wheel cannot be used to reduce 
the space complexity of the algorithm, as the table used to store the wheel itself 
grows rapidly, namely with the primorial of c. From a practical point of view, even 
with a wheel the table 7r(i/) becomes prohibitively large, and had to be eliminated 
to permit the computation of 7r(10^®). 

Given the prime number theorem, it turns out that it is possible to retrieve 7r(i/) 
for any y < j/max in constant expected time, using only 0(7r(?/max)) = 0{x^^^\og^x) 
precomputed values. The trick is to only store 7r(y) for values y that are multiples of 
[log 2 ymaxj ■ We also make use of a table of all the primes pb for b < 7r(yi„ax): such 
a table also requires 7r(yi„ax) storage locations, and is anyway required elsewhere in 
the combinatorial method [31 [H]. The method for determining TT{y) for a specific 
value of y is then as follows: firstly, we look up the value 7r(^ at the closest value 
y < y. We then iterate through the array of primes pb, starting at & = 7r(y) + 1, 
checking whether pb > y at each value of b. If pb > y, we return ttIjj) = — 1; if 

Pb < y, then we move on to b + 1, repeating the process. 

The surprising thing is the rapid speed with which this algorithm converges: 
from the prime number theorem, we expect on average one prime in the range (y, j/], 
because y—y < Llog 2 2/maxJ • Thus, the most likely situation is that 7r(y) = 7r(?/), i.e., 
the initial guess for 7r(y) is in fact the correct value, and the algorithm terminates 
after a single iteration. In practice, in the combinatorial algorithm we retrieve '!T{y) 
for many values of y, such that the average performance is indeed the relevant 
quantity. Even in the worst case, it is impossible for this algorithm to require more 
than [log 2 ymaxj iterations, which is 0(loga;), because this would contradict the 
assumption that y was the closest value y < y in the table 7r(^. 

2.2. Iterating over the squarefree y < ymax coprime to the first b primes. 

Demanding fast access to Pmin(y) for any y < ymax is equivalent to factoring any 
such value of y on demand. Pmin(y) is accessed sufficiently often that trivial algo¬ 
rithms such as trial factoring are too slow for this purpose. 

The author of m actually advocated storing the values Pmin(y)y(y) for y ^ ymax; 
where /i(y) is the Mdbius function, rather than storing Pmin(y) in isolation. How¬ 
ever, whether Pmin(y) and /i(y) are stored separately or as a product is immaterial 
for the current analysis. The values Pmin(y) require an array of y storage locations, 
each of width at least log 2 ymax; the space required to store /i(y) is negligible by 
comparison. 

As was the case with the array 7r(y), a wheel can be used to compress the 
array ymin(y)- Indeed, the calculation for 7r(10^®) was performed using a wheel to 
compress Pmin(y) and /i(y), see Appendix lAl However, even with a wheel the array 
Pmin(y) eventually becomes prohibitively large, and precluded the computation of 
7r(10^^). Luckily, it turns out that the data structure Pmin(y) can be completely 
eliminated, and /i(y) along with it. In order to do this, we investigate the purpose 
of storing Pmin(y) [12]- In fact, the only situation where this array is used is to 
iterate over all squarefree values y < ymax having Pmin(y) > Pb for different values 
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of 6 < 7r(ymax)- The author of [17] does this by iterating over all y < ?/max, and 
explicitly checking the condition pmin{y) > Pb for the given value of b. We also note 
that /i(y) is used for exactly the same values of y. 

Thus, in order to eliminate the array Pinin(y), we require an iteration scheme 
over the squarefree numbers y < ?/max coprime to the first b primes. Although 
somewhat cumbersome, it is straightforward to construct such an iteration scheme 
using a variable number of nested loops. Firstly, we loop over the primes pb^ < ?/max, 
where bi is the only loop variable, and assumes the values [&+ 1, 7r(?/max)]- We then 
loop over the biprime numbers PbiPb 2 ^ 2 /max, where bi ranges from b + 1 until 
the product pb^^Pbi+i exceeds 2 /max, and &2 ranges from 6 i + 1 until the product 
PbiPb 2 exceeds 2 /max. We subsequently loop over all numbers y that are the product 
of three distinct primes pb^, Pb 2 , and pb^, each having & < &i < &2 < and 
PbiPb 2 Pb 3 < 2 /max, using similar break conditions as above. This process is repeated 
until the largest possible number of factors for y has been exceeded, which occurs 
when 23 b+ipf,+ 2 Pb +3 .. -Pb+n > 2/max, where n is the number of nested loops. For 
example, piP 2 ---PiQ = 2 • 3... 53 > 2®^, so if 2 /max is 64 bits or smaller, then 
n < 16. Furthermore, each value oi y = PbiPb 2 ■ ■ -Pb^ is squarefree by construction, 
so pL{y) = (—1)" for each y. 

2.3. Reducing the size of the sieve counters. Reducing the size of the sieve 
counters is easy in comparison to ‘^{y) and Pmin( 2 /)- Firstly, we note that one can 
simply reduce the number of counters, without negative effects on the runtime m 
By definition, the width of the sieving intervals in the combinatorial algorithm for 
computing tt{x) is equal to the number of sieve counters, which we have denoted 2 ^. 
Given that the upper limit of the sieve is x/ 2 /max, there are a total of cc/(2^ 2 /max) 
intervals. Supposing that the overhead per sieving interval is proportional to the 
number of sieving primes, 7r(2/max), the total overhead associated with subdividing 
the sieving intervals is proportional to x/{2^ logx) by the prime number theorem. 
If the overall time complexity is to be kept at 0(a;^/^log“^x), then this implies 
2^ > log X, for some constant 7 € R. Choosing this minimal value of L 

results in sieve counters a factor of log x smaller than needed to achieve our target 
space complexity of 0(x^^^\o^x). This is consistent with numerical experiments, 
where we find that the optimal value of 2 ^ to minimize the runtime is substantially 
smaller than 2 /max- 

Despite the logger reduction above, there is still an incentive to further reduce 
the size of the sieve counters. Although it is not necessary, it is helpful in a shared- 
memory architecture to allocate separate sieve counters for each parallel thread. 
This permits parallelization at the level of sieving blocks, which is sufficiently coarse 
as to carry relatively little overhead, yet sufficiently fine that load balancing is 
relatively easy. If such an approach is taken, then the memory usage of the counters 
is multiplied by a factor of the number of threads iV, which limits N to log a; or 
smaller if the memory usage is to be kept at 0{x^/^\o^x). 

Two additional approaches for reducing the size of the sieve counters are apparent 
to the author. Firstly, it should be possible to substantially reduce the amount of 
overhead per interval using a variant of the bucket sieve algorithm developed by 
Oliveira e Silva [18] . The basic idea of bucket sieving is to not sieve every interval 
by every sieving prime, but rather to allocate each sieving prime to a “bucket” 
that indicates the next interval in which a multiple of the prime appears. Buckets 
are then sequentially processed, one bucket per interval, with each sieving prime 
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encountered being moved to a later bucket. In this fashion, the only primes that 
are encountered in each sieving interval are the ones for which multiples actually 
appear in that interval. This permits significantly smaller sieving intervals to be 
used, effectively eliminating the width of the sieving interval as a contributor to 
memory usage. Such an approach may even permit the entire sieve table to be 
stored in the processor’s data cache, providing greatly enhanced performance as 
compared to main memory [18) . 

The other potential approach for further reducing the memory usage of the sieve 
counters involves more efficiently packing the values. The sieve counters suggested 
by Oliveira e Silva, and used by the present author, have a fractal-like structure 
m- For a complete description of the workings and necessity of the sieve counters, 
we direct the reader to m- What matters for us is that the counters are each 
initialized with a number 2^, for some 0 < i < L, and then decremented from that 
initial value. This implies that the largest counters need to be stored using integer 
data types with at least L bits. Thus, if a common binary representation is used 
for each of the 2 ^ sieve counters, then the total storage requirement is L2^ bits. 
With the sieve counters indexed using a single variable as in one can probably 
not avoid using a common binary representation for each of the 2^ counters. We 
note, however, that it is possible to pack the values much more efficiently, resulting 
in an average of 2 bits per counter, such that the total requirement is 2^+^ bits. 


3. Modifications for shared-memory parallelism 

There are several practical approaches for parallelizing the algorithm on a shared- 
memory architecture. Firstly, there is one important part of the algorithm, namely 
the “easy leaves” m in the computation of the partial sieve function (j){x, a), which 
can be made embarrassingly parallel. Here (/>(a;, a) denotes the count of natural 
numbers < x that are coprime to the hrst a primes. The so-called easy leaves do 
not depend on the main sieve, do not need to be interleaved with other parts of the 
algorithm, and can be computed completely in isolation of one another. 

The difficult part of the parallelism is the main sieve, where the partial sieve 
function 4‘{m, h) is made available for each m < x/ymax and each prime b < 7r(ymax)- 
The values of ^(m, b) for smaller values of m and b are needed in order to compute 
4>{m, b) for larger m and b, which precludes the embarrassingly parallel computation 
of 4>{rn^b). The approach taken by the current author is to exploit the fact that 
the sieving is already broken into blocks of length 2^. Specifically, one sieves each 
of N subsequent blocks in parallel, working not with (j){m,b), but with (^(m, 6 ) — 
'Pi’iTT-min, b), where TOmin is the beginning of the sieving interval under consideration. 
Each time a value ^(m, b) needs to be added to a running sum without knowledge 
of (j){m^in,b), this discrepancy is recorded in a tally. Once each thread is done 
sieving the interval [mmin, Wmin + 2 ^), the values 4>{m^in + 2 ^, 6 ) — ^(mniin, b) can 
be used to compute each (^{mmimb), starting at the smallest value of mmin, and 
the discrepancies represented by the tallies can be resolved. 

An algorithm that relies on the above idea has several drawbacks. Firstly, sepa¬ 
rate sieve counters are needed for each thread, which multiplies the memory usage 
of the sieve counters by a factor of N. Secondly, the tallies needed to keep track 
of the discrepancies between (f)(rn,b) and (^(m, 6 ) — (/)(mmin,fc) require a similar 
amount of memory as the sieve counters. Finally, synchronization is required after 
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each thread sieves a single block, which carries unnecessary overhead. Nonetheless, 
this approach was found to be efficient enough for the purposes of the author. 

After completing the bulk of the current project, the author was made aware of 
the yet-unpublished work of Kim Walisch. Walisch employs an adaptive algorithm 
for shared-memory parallelism, where blocks are scheduled dynamically depending 
on the runtime of previous blocks. Such an approach is certainly more efficient than 
synchronizing each iteration, which is important if a large shared-memory machine 
is to be used. 

Another potentially attractive approach for shared-memory parallelism, in terms 
of both time and space, would be to combine adaptive scheduling with the distributed- 
memory parallelism algorithm that will be explained in the next section. By lever¬ 
aging a distributed-memory algorithm even on a shared-memory architecture, the 
dependence between subsequent iterations would be broken, completely eliminat¬ 
ing the need for communication between threads. Any constant arrays, such as the 
table of primes pb for b < 7r(?/max), could still be shared between the threads to save 
space on a single shared-memory node. 

4. An algorithm permitting distributed-memory parallelism 

Distributing the computation of tt{x) between multiple compute nodes was nec¬ 
essary for the author to compute 7 r( 10 ^®). The principal issue with distributing the 
computation is that the simplest algorithms described in Section [3] rely on rapid 
exchange of information between compute nodes. Although it is in principle pos¬ 
sible to efficiently distribute such a calculation, the greatest degree of parallelism 
can only be achieved if internode communication can be minimized or eliminated. 

Fortunately, it is possible to parallelize the combinatorial algorithm for comput¬ 
ing 7:{x) in a way that requires no interprocess communication whatsoever, with 
the exception of summing the contribution to 7 r(x) for each job after the fact. This 
is highly efficient for the machine, but requires use of a supporting algorithm to 
break the interdependence of the jobs. 

The following algorithm for distributed-memory parallelism is loosely based on 
an unpublished idea of X. Gourdon [ 6 ] . Specifically, the issue is that the sums in 
the main part of the combinatorial algorithm depend on the partial sieve function 
b), which represents the count of numbers up to m that are coprime to the first 
b primes. Sieving an interval [TOmin,Rimin + 2^) only reveals the values <j)(rn,b) — 
<^(Rimin, b). Thus, determining (^(m, b) requires storing b), updating it after 

sieving each block, and using the updated value while sieving the next block to 
obtain any values (/'(m, b) of interest. This approach works fine if the sieve is 
started at mnUn = 0 , because the recursive dependence terminates with ^( 0 , b) = 0 . 

If the sieve is to be started somewhere in the middle because, for example, earlier 
blocks are being simultaneously sieved on some other computer, then we need a 
method to independently compute 

What is needed is an algorithm that can compute b) for a given value of 
m = mmin and every c < b < 7r(?/inax)- An idea for how to do this was given in [B], 
namely to repeatedly apply the recurrence 

(4.1) b) = 4>(m, 6 - 1) - 4>{m/pb, b - 1). 

Here c is the size of the wheel being used in the sieve, so c) is accessible for 
any m G N in 0(1) time [1^. Given (j){m,c), the idea is to compute (j){m/pc,c) to 
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obtain (j)(m,c+ 1). This can be done using the same implementation intended for 
the overall computation of 7r(x), which is able to compute (p{x, a) for varying values 
of X and a. The process is then repeated, to obtain (j)(rn, c + 2), (p{m, c + 3) and 
onwards up to (j){m,Tr(y^ax))- 

The difficulty with the above idea is the amount of time needed to perform this 
process; it would not affect the overall computational complexity of computing 7r(x), 
but a simple interpretation of this idea was too slow to be used for the computation 
of 7r(10^®). The general idea, however, is sound, and modifications can be made to 
substantially decrease the cost. 

The approach taken here is a multifaceted one, where varying methods are used 
to compute ^(m/pb, b) depending on the values of b. Again, c) is available in 
0(1) time for any m G N using the sieving wheel. The wheel can also be used to 
compute c + 1) in 0(1) time via 4>(m,c) and <^(jnj-paic)- The difficult cases 
occur for c + 2 < 6 < 7r(-y/m). We first check whether If this is the 

case, then we directly apply dS]), using the combinatorial algorithm to compute 
(/)(m/pb, b— 1). If, on the other hand, p^-i > then Legendre’s formula applies, 

such that 4>{m/pb, b — 1) = 7r{mjpb) — b + 2. We next check whether m/pb < Umax- 
If this is the case, then we can use the method described in Section O to retrieve 
Tr{m/pb) in 0(1) time. If m/pb > j/max then Legendre’s formula still applies, but 
we must compute 7r{mjpb) by some other method, e.g., using a second application 
of Legendre’s formula or the combinatorial algorithm. For the remaining values 
Tr{y/m) < b < 7r(?/rnax), determining (()(m, b) is trivial given b— 1). Specifically, 
if m < Umax then b) = 4){m, &—1) —1 for Tr{y/m)+l < b < irirn) and 4>(rn, b) = 1 
for Tr{m) + 1 < 5 < 7r(?/niax). If m > ymax, then (p{m,b) = (j)(rn,b — 1) — 1 for all 
■7r{y/m) + l <b < 7r(yniax). 


5. Numerical results 

The combinatorial algorithm was implemented and used to compute 7r(10") for 
1 < n < 26 and 7r(2™) for 1 < m < 86, see Tables [T] and [2] The values 7r(10") for 
1 < n < 25 and 71(2"*) for 1 < m < 80 were checked and found to be consistent 
with the work of previous authors [m u. We note that the values 7r(2™) for 
m = 77, 78, 79,80 were previously computed under the assumption of the Riemann 
hypothesis [1], and were apparently never verified unconditionally until this study. 
The values 7r(10^®) and 7r(2™) for 81 < m < 86 were first reported in this study. 
These new values were checked in three ways. First, each new value was computed 
twice, using separate clusters and differing numerical parameters (a, c, and L). 
Second, the values were checked against the logarithmic integral to ensure the 
results were reasonable. Third, at the suggestion of Robert Gerbicz, the parities of 
the new values of 7r(a;) were checked and found to be consistent with those computed 
by Lifchitz using a yet-unpublished algorithm [T2] . 

6. Summary 

Recent advances in the combinatorial algorithm for computing Tr(x) were pre¬ 
sented together with numerical results. Specifically, memory usage has been reduced 
by a factor of logo:, and algorithms for shared- and distributed-memory parallelism 
have been developed. The resulting algorithm computes tt{x) using 0{x^^^\og~'^x) 
arithmetic operations and 0{x^^^\og x) memory locations, each of width propor¬ 
tional to logic. An algorithm for shared memory parallelism appeared previously in 
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Table 1. Values of 7r(x) for x = 10" 


X 

7r(a;) 

li(a;) — '!t{x) 

lOi 

4 

2.166 

10^ 

25 

5.126 

lO^ 

168 

9.610 

10^ 

1229 

17.137 

lO'^ 

9592 

37.809 

10® 

78498 

129.549 

10" 

664579 

339.405 

10® 

5761455 

754.375 

10® 

50847534 

1700.957 

IQio 

455052511 

3103.587 

IQii 

4118054813 

11587.622 

1012 

37607912018 

38262.805 

IQi® 

346065536839 

108971.050 

1014 

3204941750802 

314889.954 

IQi® 

29844570422669 

1052618.581 

IQi® 

279238341033925 

3214631.793 

IQi" 

2623557157654233 

7956588.778 

IQi® 

24739954287740860 

21949555.022 

IQi® 

234057667276344607 

99877775.223 

102O 

2220819602560918840 

222744643.548 

1021 

21127269486018731928 

597394254.333 

1022 

201467286689315906290 

1932355208.151 

1023 

1925320391606803968923 

7250186215.780 

1024 

18435599767349200867866 

17146907278.151 

1025 

176846309399143769411680 

55160980939.379 

1026 

1699246750872437141327603 

155891678120.791 


the literature [7], but not for the most recent versions of the algorithm [3 HI]; the 
basic idea necessary for distributed memory parallelism appeared in an unpublished 
manuscript [5]- The memory reduction presented here appears to be new. Previ¬ 
ously reported values pn 11] of 7r(10”) for 1 < n < 25 and 71(2"*) for 1 < m < 80 
were verified; the values 7r(10^®) and 7r(2'") for 81 < m < 86 were computed and 
checked in several ways. 

We are now in the interesting situation where two different types of algorithms, 
combinatorial and analytic, are closely matched for practical calculations of 7r(a;). If 
nothing else, this situation gives unprecedented confidence in any numerical results 
computed consistently using both types of methods, which is currently the case 
with 7r(10") for 1 < n < 25 and 7r(2'^) for 1 < to < 80. 

Appendix A. Implementation details 

The description in m was used as a starting point for the implementation, 
with the enhancements of Sections [2H4| gradually incorporated. The implementa¬ 
tion was written in the C99 programming language, with significant effort devoted 
to ensuring the correctness of the program. Fast unit tests were run on a de¬ 
velopment machine for every committed version of the code, with more extensive 





X 

22 

23 

2 ^ 

2 ® 

2 ® 

2 ^ 

2 ® 

2 ^ 

210 

2 “ 

212 

213 

214 

215 

216 

217 

218 

219 

220 

221 

222 

223 

224 

225 

226 

227 

228 

229 

230 

231 

232 

233 

234 

235 

236 

237 

238 

239 

240 

241 

242 

243 
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Table 2. Values of 7r(x) for a; = 2™ 


7r(a;) 

X 

7r(a;) 

1 

to 

597116381732 

2 

245 

1166746786182 

4 

246 

2280998753949 

6 

247 

4461632979717 

11 

248 

8731188863470 

18 

249 

17094432576778 

31 

250 

33483379603407 

54 

251 

65612899915304 

97 

252 

128625503610475 

172 

253 

252252704148404 

309 

254 

494890204904784 

564 

255 

971269945245201 

1028 

256 

1906879381028850 

1900 

257 

3745011184713964 

3512 

258 

7357400267843990 

6542 

259 

14458792895301660 

12251 

260 

28423094496953330 

23000 

261 

55890484045084135 

43390 

262 

109932807585469973 

82025 

263 

216289611853439384 

155611 

264 

425656284035217743 

295947 

265 

837903145466607212 

564163 

266 

1649819700464785589 

1077871 

267 

3249254387052557215 

2063689 

268 

6400771597544937806 

3957809 

269 

12611864618760352880 

7603553 

270 

24855455363362685793 

14630843 

2^1 

48995571600129458363 

28192750 

272 

96601075195075186855 

54400028 

273 

190499823401327905601 

105097565 

274 

375744164937699609596 

203280221 

275 

741263521140740113483 

393615806 

276 

1462626667154509638735 

762939111 

2^7 

2886507381056867953916 

1480206279 

278 

5697549648954257752872 

2874398515 

279 

11248065615133675809379 

5586502348 

280 

22209558889635384205844 

10866266172 

281 

43860397052947409356492 

21151907950 

282 

86631124695994360074872 

41203088796 

283 

171136408646923240987028 

80316571436 

284 

338124238545210097236684 

156661034233 

285 

668150111666935905701562 

305761713237 

286 

1320486952377516565496055 
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Table 3. Resources usage for computing 7r(a;) with x = 10” 



Time 
[node s] 

Memory 

[bytes] 

Time 
[node s] 

Memory 

[bytes] 

X 

Version 2014.10.19 

Version 2015.01.30 

IQi'^ 

1.48 

X 

10° 

3.81 

X 

107 

1.18 

X 

10° 

1.95 

X 

107 

IQie 

6.07 

X 

10° 

4.31 

X 

107 

5.27 

X 

10° 

2.16 

X 

107 

1017 

2.68 

X 

lOi 

5.78 

X 

107 

2.59 

X 

lOi 

2.71 

X 

107 

IQis 

1.31 

X 

10^ 

1.69 

X 

10® 

1.08 

X 

10^ 

1.01 

X 

10® 

1019 

5.83 

X 

10^ 

3.44 

X 

10® 

6.07 

X 

10° 

1.74 

X 

10® 

1020 

2.89 

X 

10° 

1.73 

X 

10° 

2.56 

X 

10° 

1.27 

X 

10° 

10^1 

1.20 

X 

io4 

3.22 

X 

10° 

1.04 

X 

104 

1.92 

X 

10° 

1022 

5.06 

X 

io4 

5.81 

X 

10° 

4.68 

X 

104 

2.98 

X 

10° 

1023 

2.27 

X 

10° 

1.16 

X 

101° 

2.17 

X 

10° 

5.23 

X 

10° 

10^4 

1.07 

X 

10° 

2.41 

X 

101° 


- 


1.00 

X 

101° 

1025 

5.25 

X 

10° 

5.16 

X 

101° 


- 


2.01 

X 

101° 

1026 

2.98 

X 

107 

1.12 

X 

lOii 


- 


4.16 

X 

101° 


unit tests frequently run on the target cluster. All code was demanded to compile 
without warning using the GCC 4.9.1 compiler with the default warning level, and 
to pass static analysis with the Clang Static Analyzer. Precisions of finite-width 
data types were artificially reduced to intentionally break the program and identify 
failure modes. Unit tests were written covering wide ranges of parameter values, 
including edge-cases chosen specifically with the intention of breaking the program. 
In general, all code was written and checked as strictly as the author was capable 
at the time of writing. 

In Table [3] we show resources usage for computing 7r(10”) using two different 
versions of the author’s implementation of the combinatorial algorithm. The first 
version of the software, 2014.10.19, was missing the advancement presented in Sec¬ 
tion 12.21 this is the version of the software used in the original computations of 
7r(10^®) and 7r(2™) for 81 < m < 86. In this table, time is measured in “node 
seconds”, i.e., it is the sum of the actual time spent on all compute nodes for that 
calculation. Similarly, memory usage is memory per node. Here a “compute node” 
was an IBM iDataplex dx360 M4, having a total of 16 CPU cores (2 x Intel Xeon 
E5-2670 eight-core 2.60 CHz CPUs) with either 64 or 128 GB RAM (8 GB PC3- 
12800 ECC RDIMM modules) depending on the requirements of the calculation. 
Thus, 2.98 X 10^ node s for computing 7r(10^®) corresponds to roughly 15.1 CPU 
core-years. 
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